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THE STEIN-DIRICHLET-MALLIAVIN METHOD 


L. DECREUSEFOND 


Abstract. The Stein’s method is a popular method used to derive upper- 
bounds of distances between probability distributions. It can be viewed, in 
certain of its formulations, as an avatar of the semi-group or of the smart-path 
method used commonly in Gaussian analysis. We show how this procedure 
can be enriched by Malliavin calculus leading to a functional approach valid 
in infinite dimensional spaces. 


1. Introduction 

Distances between probability or probability metrics is a very old topic since it 
is rich of a wide range of applications. As mathematical objects, it is natural to 
define a metric topology on spaces of probability measures. As modeling objects, 
it is natural to compare probability measures which appear in the mathematical 
representations of random phenomena. This topic has at least three facets: The 
diverse definitions of probability metrics which are tailored for each applications; 
the computations and comparisons of these different distances for the widest pos¬ 
sible range of situations and at last, the applications which go from mathematical 
considerations like functional inequalities to more practical results of rate of conver¬ 
gence of stochastic algorithms. The Figure 1 shows a partial view of the different 
aspects of this subject. 



Figure 1. Mindmap 

A few words are in order to explain the blue and red colors. For the computa¬ 
tions of distances between measures and v, we need to impose some relationships 
between these two measures. Absolute continuity is one very frequent type of re¬ 
lationships between two measures. The Radon-Nykodim theorem gives a precious 
tool to estimate divergence-like and Wasserstein distances (see for instance [15] for 
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such an application). One may also reverse the point of view: Given a positive 
function F, compare the /r and ly = F dfx to obtain some precious functional in¬ 
equalities on F (see [1]). These results thus belong to the same spirit and are 
colored in blue. Another natural way to put a structure between two measures is 
to have a map which transforms a known measure into another one and to compare 
this transformed measure to a reference probability. This is exactly the framework 
in which the Stein’s method performs well if we consider Kantorovitch-Rubinstein 
type distances (defined below). Typical applications of these form of distances are 
to give the convergence rates of celebrated theorem like CLT or Berry-Esseen The¬ 
orem or of random algorithms [25]. The links between these different points justify 
that they are all colored in red. 

This paper is a rather informal introduction to the Stein-Dirichlet-Malliavin 
method (SDM for short henceforth). This is an extension of the classical Stein’s 
method, enriched by the structure given by Dirichlet forms and Malliavin calcu¬ 
lus. We hope that this new point of view will lead to more systematic proofs of 
convergence, extending their applicability. The price to pay is to master some new 
concepts from Malliavin calculus like the gradient and its associated adjoint. That 
is why we tried to maintain the technicalities at the lowest possible level, insisting 
more on the ideas at play. 

We first show the different kinds of probability metrics that exist in the litera¬ 
ture. We do not pretend to be exhaustive but aim to point out to the wide diversity 
of possible definitions. In Section 2, we establish the principles of the SDM method 
and show how it can be applied to the Poisson-Gaussian convergence. We then ex¬ 
plain how to construct the necessary structures to extend this procedure to infinite 
dimensional spaces. In Section 4, Edgeworth expansions are obtained by iterating 
the previous procedure as often as desired. 

2. Taxonomy of probability metrics 

In what follows, all the probability measures are defined on Polish spaces denoted 
either by 2: or whose borelian cr-fields is $(€), respectively ®(5'). There are 
several notions of metrics between probability measures. An interesting survey of 
the main variants and their mutual relationships can be found in [17]. Each of one 
is often adapted to a particular purpose. They can roughly and partly be classified 
in three types. The first one is the so-called Prokhorov distance. 

Distpro(P, Q) = inf|e > 0, P(A) < Q(A'’) -|- e for all A G iB((£)|, 

where A*’ is the e-neighborhood of A defined by = {y G C:, 3x G A, d{x, y) < e}. 
This distance is crucial as its associated topology is precisely the topology of the 
convergence in distribution, i.e. we have the following theorem which can be found 
in [13]. 

Theorem 1. A sequence (P„, n > 1) of probability measures converges weakly to 
P if and only if Distpro{Pni P) tends to 0 as n goes to oo. 

Unfortunately, this distance is hardly computable and that justifies the search for 
alternative and more tractable definitions. A vast category of probability metrics 
is represented by the /-divergence defined as follows. 

Definition 1. Let f be a convex function such that /(I) = 0. Then, for two 
probability measures P and Q on a Polish space €, 

I oo otherwise. 
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For instance, if we choose / = tint, we obtain the Kullblack-Leibler distance. 
The Hellinger distance corresponds to the case where /(t) = {\/t — 1)^. Total 
variation between absolutely continuous measures boils down to take /(t) = |t — 1|. 

Another class of distances between measures can be obtained via optimal trans¬ 
portation theory. For general results about this theory, we refer to the books 
[24, 25, 29, 28]. 

Definition 2. Let (€, P) and Q) two Polish spaces equipped with a probability 
measure and c a semi-continuous function from € x ^ to R+ U {oo}. The optimal- 
transportation problem or Monge-Kantorovitch problem MKPfP, Q) is to find 

min / c{x,y) d'y{x,y) 

7GS(P,Q) Jgxj 

where S(P, Q) denoted the space of probability measures on with first marginal 
P and second marginal Q. 

Said otherwise in a more probabilistic way, it amounts to find the coupling be¬ 
tween P and Q which minimizes the cost, i.e. to construct on the same probability 
space, two random variables X and Y of respective distribution P and Q which 
minimizes E [c(A, T)] among all the possible constructions. The usual cost func¬ 
tions are of the type c{x,y) = dist(x,j/)^ where dist is a distance and p a positive 
real number. For the Euclidean distance and p = 2, we can construct the so-called 
Wasserstein distance by considering 

1E(P,Q) = 

All the distances viewed so far are not unrelated as many functional inequalities 
do exist between all of them. Just to mention two examples, the Pinsker inequal¬ 
ity states that the total variation distance is controlled by the Kullblack-Leibler 
distance. 

^|t-i|(P,Q) < 

On the other hand, the so-called HWI identity (see [28]) relates the relative entropy 
(H), the Wasserstein distance (W) and the Fischer information (I) as follows. 

Theorem 2. Let P and Q two probability measures on R" such that P = exp(—K) dx 
with > KIdn- Then, 

Ai„i(P, Q) < W{V, Q)^Dv|intp(P,Q) - Qf. 

These examples are here only to give a glimpse of the vast subject of the re¬ 
lationship between all these notions of distances. However, this is not the true 
subject of the present paper. The theorem which justifies the sequel is known as 
Kantorovitch-Rubinstein theorem (see [13, 14]) and says the following. 

Theorem 3. For P and Q two probability measures on a Polish space €, consider 
the Monge-Kantorovitch problem for a cost function c which is a distance on €. 
Then, we have the following representation 

y') dri{x,y) = sup (Ep [F] - Eq [F]) , 

7GE(P,Q)Jexg FGLip„(l) 

where F € Lip^(l) means that F is c-Lipschitz continuous: \F(x) — F{y)\ < c{x,y) 
for all X, y € E. The resulting distance between P and Q, will be called henceforth 
the Kantorovitch-Rubinstein distance as in [28]. 
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This formulation of a distance motivates alternative definitions by changing the 
set of test functions. For instance, for T = {l(-oo;z:], x £ R}, 

sup |Ep [F] — Eq [F]| 

is the total-variation distance. It turns out that Stein’s method is particularly well 
suited to estimate such kind of distances as we shall see now. 

3. Stein’s method 

Historically, the Stein’s method for Gaussian distribution dates back to the sem¬ 
inal paper of Stein [27]. It was soon extended to the Poisson distribution in the 
paper of Chen [7]. It is then impossible to track all the extensions of this approach, 
made mainly by A. Barbour and his collaborators, to several other distributions 
like compound Poisson [5], Poisson point processes [30], stationary measure of birth- 
death process, even Brownian motion [2]. For a whole account of all this period, 
one may refer to the books [3, 4] and references therein. The main breakthrough 
came with the paper of Nourdin and Peccati [21], in which it is shown that combin¬ 
ing Malliavin calculus and Stein’s approach, one can obtain a rather simple proof 
of the striking fourth moment theorem, established earlier in [22]. This was the 
starting point of a bunch of articles with with a wide area of applications: rate of 
convergence in the central limit theorem. Berry-Esseen theorem, iterated-logarithm 
theorem, limit theorems on manifolds, etc. 

3.1. Dirichlet-Malliavin structure. The procedure of the Stein’s method can 
be abstracted within the setting of Dirichlet structures (for details, we refer to 
[6, 16, 20]). The subsequent explanations are at a very formal level since the hard 
part for this machinery to work is to find the convenient functional spaces for each 
case of applications. 

The first idea underlying the Stein’s method is to characterize the target measure 
by an algebraic equation: Find a functional operator L on F such that Eq [LF] = 0 
for any F in F if and only if Q = P. It turns out that this functional operator 
L can be viewed as the (infinitesimal) generator of a Markovian semi-group, which 
we denote by F = (Ft, t > 0) whose stationary measure is P: The image measure 
of P by Ft is still P for any t > 0. Under some technical hypothesis, there exists 
a strong ergodic Markov process X = {X(t), t>Q) of invariant measure P and of 
generator L. It must be noted that the knowledge of one of F, F or X is equivalent 
to the knowledge of the other two. Formally speaking, for any x £ E, 

Ptfix) = e^^fix), Lf{x) = , Ptf{x) = E [f{X{t)) \ X(0) = x]. 

One can also associate to X, the so-called Dirichlet form defined formally by 

F(F,G) =Ep [LEG], 

for any F and G sufficiently regular. As before, if we are given such a bilinear form 
£, one can retrieve L by the following relationship: For any F, LF is the unique 
element Ff such that for any G, £{F, G) = Ep [HG]. This means that whichever 
of L, X, P or £ we are given, the others are uniquely determined (the reader is 
referred to the particularly illuminating Diagram 2, page 36 of [20]). 

Within this framework, it is easy to see that the Stein-Dirichlet representation 
formula holds: For any bounded F, 

■ pOO 

(1) Eq [F] - Ep [F] = Eq / LPtFAt . 

Uo 
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This formula is also known as the semi-group method or the smart-path formula in 
the Stein’s method literature. This means that we can write 

distjr(P, Q) = sup |Ep [F] - Eq [F] \ = sup 

Instead of using coupling arguments to estimate this right-hand-side as usually done 
in the Stein’s method, we use another functional operator which is the gradient in 
the sense of Malliavin. It is usually denoted by D and satisfies the identity L = D*D 
where D* is the adjoint of D. This a square root of the symmetric operator L, but 
not all square-roots are interesting as we also need a nice commutation relationship 
between D and P. A few examples are the best way to illustrate what we mean. 


E 


Q 


LPfF dt 


L ./0 


3.2. One dimensional examples. If P denote the standard Gaussian measure 
on R, then X is the Ornstein-Uhlenbeck process defined by 

dX{t) = V2 dB{t) - X{t) dt, A:(0) = x, 

where i? is a standard one-dimensional Brownian motion. A straightforward appli¬ 
cation of the ltd formula gives the following expression of X: 

X{t) = e-*x + V2 f dB{s). 

Jo 

It is then easy to see that X{t) ^ I — which, in turn, entails the 

Mehler representation formula: 

PtF{x) = [ F{e-*x + Vl - e-2*j/) dP{y). 

Jr 

It follows by differentiation and integration by parts that for F G C^, 

LF{x) = xF'{x) — F''{x), for all a; G R. 


The Malliavin gradient is the usual derivative operator and standard computations 
show that 

f DF{x)G{x) dP(x)= [ F{x){xG{x) - DG{x)) dP(a;), 

Jr Jr 

hence that D*G{x) = xG{x)—DG{x) and L = D*D. Moreover, we have DPtF{x) = 
e~*PtDF{x) which is the commutation relationship alluded above. 

If P represents the Poisson measure on N of parameter A, the process X can be 
viewed as the number of occupied servers in an M/M/oo queue (see [11]), L is the 
corresponding generator: 

LF{x) = X{F{x -b 1) — F{x)) -b x{F{x — 1) — F{x)), for all a; G N, 
with the convention that O.F(—1) = 0. The gradient is defined by 

DFix) = Fix + 1) - F{x), 

and we have DPtF = e~*PtDF (see [11, Theorem 11.16] or [12]). For the scalar 
product in T^(P), we have 

(2) [ DF{x)G{x) dPix) = [ F{x)i^G{x-l)-G{x)) dPix). 

Jn dN 

Hence, 

D*Fix) = jGix - 1) - G(a;) and L = D*D. 

A 

We now show how these constructions do articulate to give a new approach to the 
Stein’s method. 
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It is well known that for Z\ a Poisson random variable of parameter A, 

Z\ = ^ A/’(0, 1) in distribution. 

V A 

We are going to use the Stein-Dirichlet-Malliavin method to evaluate the rate of 
convergence. We are in a situation where the target measure in defined R whereas 
the initial randomness comes from a probability measure on N. The map T defined 
by 


= N - 

-^5 = R 


n — X 

Ti 1- 



maps one space to the other and we are to evaluate the distance between T*Q\, 
the image measure of Qa, the Poisson(A) probability, by the map T and P the 
standard normal distribution on R. This is a particular case of the general situation 
illustrated in Figure 2. 


Initial space Target space 



Figure 2. Comparison between a measure P and T*Q. 


In view of (I), we have to estimate 

sup [ [ x.{PtF)'{x) - {PtFy'{x) dT*QA(a;) dt, 

FeJ^Jo Jr 


where is the Ornstein-Uhlenbeck semi-group given by the Mehler formula above 
and is a functional space to be conveniently chosen. According to the definition 
of T, the quantity to maximize is equal to 


E 



h.{PtFy{z^) - {PtFyyzx) dt 


Applying (2) to G = I and F oT, we get 


Ta E 


FiZx + ^) - nzx) 


= E 


F{Zx) 


Hence, 
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For any t > 0, the regularizing properties of Pt entails that PtF is thrice differen¬ 
tiable. Hence, 

( 4 ) 

iP,Fy(Zx+^)-{PtFy{Zx) = -^{PtFriZx)+j j\l-r)iPtF)^^\Zx+^) dr. 

And then, a miracle occurs: The term involving the second order derivative vanishes 
and we are lead to maximize 


( 5 ) 


yx 


E 


[ / (1 - r){PtF)^^\Zx + dt 

Jo Jo vA . 


for F over P. There is now a delicate point. If F is in C^, we already mentioned 
that 

{PtFy{x) = e-*Pt{F'){x). 

Furthermore, by integration by parts with respect to the Gaussian measure, it is 
easy to see that 


(t) = F(e-‘x + Vl - e-2‘j/) / dP(y), 

whenever F is bounded, for any fc > 1. At first glance, it seems easy to bound (5) 
by using the previous formula for fc = 3. Unfortunately, the term exp(—fct)(l — 
exp(—is integrable over [0,-|-oo) only for k = 1. Hence, we must choose 
= {F G Cl, 11 ^ 11^2 < 1} and then we have 


(PtF)(3)(x) 




■</l — e 

Plugging this inequality into (5), we get 


f *x+ \/l — y dP(i/) 

Jr 


,-3t 


Vl — e 


/R 


\y\ dP( 2 /). 


( 6 ) 


sup 

IFIIc2<i 


E 


FiZx) 


- F dP 




o-3t 


it £HdP(„) = ^F. 


/o Vl — e 

Hence we have established the rate of convergence for the Kantorovitch-Rubinstein 
distance associated to P = {F G C^, ||F||c 2 < 1}. In dimension I, for Gaussian 
approximation, we could have used LF(x) = xF{x) — F'{x) as a characterizing 
operator and thus used only 1-Lipschitz functions with a slightly different constant 
in front of the factor, namely 

E [f(Za) 1 - /Fdp| < ^ ^ 


sup 

FeLip(l) 


A 


Note that this upper-bound is better than the bound obtained by the classical 
Stein’s method where (27r)“^/^ is replaced by 1. However, this line of thought is 
not applicable to higher dimensions. 

More generally, the recipe of the Stein-Dirichlet-Malliavin method is the follow¬ 
ing. 

• Gharacterize the target measure as the stationary distribution of an ergodic 
Markov process, 

• Gonstruct the two Dirichlet-Malliavin structure on both initial and target 
spaces, 

• Perform an integration by parts on the initial space (see (3)), 
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• Replace the gradient on the initial space by a function of the gradient on 
the target space (this is done here by the Taylor formula (4)), at the price 
of additional terms to be controlled, 

• Finish the computations in the target space using the commuting relation¬ 
ship : DPt = e~*PtD. 

3.3. Higher dimensions. This procedure can be generalized to any dimension 
provided that we have Dirichlet-Malliavin structures on both the initial and the 
target spaces. For the Gaussian measure in dimension d, the generator is given by 

(7) LF{x) = x.DF{x) — AF(x), for all x G R'^, 

where D is the usual gradient in and A is the Laplacian operator. The Mehler 
formula stays formally the same with an integral over R*^ instead of R and X is 
the R'^-valued process composed of d independent copies of the one dimensional 
Ornstein-Uhlenbeck process. The Malliavin gradient is still the usual gradient and 
the commutation relationship between D and Pt is easily seen to hold again. We 
can then retrieve the results of [23]. 

Real difficulties arise when we try to generalize this approach to infinite dimen¬ 
sional spaces like the Wiener space. It is tempting to define L formally as in (7), 
replacing the Laplacian by the trace oi Do D. Unfortunately, for this trace term to 
exist, we need to restrict the space F of test functions and to choose conveniently 
the space There are actually two papers which address this problem. In both 
of them [ 8 , 26], despite apparent dissimilarities, we end by considering ^ a Hilbert 
space with a Gaussian measure. 

Let us show how it works on an example. For N\ a Poisson process on R+ of 
intensity A, it is known that 

N\{t) = B{t) in distribution, 

V A 

where R is a standard Brownian motion and the convergence is understood to 
hold in D, the Skorohod space of rcll functions. To compare the two distributions 
implies to find a common Hilbert space which supports both the distribution of 
B and N\. In principle, any Sobolev-like space should do. In [ 8 ], we chose the 
so-called Besov-Liouville space for /3 < 1/2 defined by 

= {/, 3/ G L^{[0, Ij) such that f{x) = ^2/}- 

It is a Hilbert space when equipped with the scalar-product (/, g)i 3,2 = (/, 5 )^ 2 . 
The Wiener measure on this space, denoted by is defined by 

[exp(i( 7 ?, w)/ 3 , 2 )] = exp(-i(U/ 3 r 7 , 

where 

Jo ^y^ J dy 

and V 0 = o o o /-/. 

The Ornstein-Uhlenbeck semi-group on (/^’^, /X/ 3 ) is defined for any F G /i/ 3 ) 

by 

Pf F{u) := I F(e~*u + \/l — e~'^* v) d/i/ 3 (v). 

The gradient is the Frechet gradient on and all the other properties still holds 
formally as in finite dimension. 
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As initial space, we consider € = the space of locally finite configurations on 
R'*' equipped with the vague topology. The measure Qa is such that the canonical 
process, denoted by N\, is a Poisson process of intensity A, for details we refer 
to [8]. On the initial space, we actually only need to know the gradient and an 
integration by parts formula. Here, we take 

D^FiNx) = F{Nx + 6,) - F^Nx), 

where N\ + is the configuration N\ with an additional atom at location x. The 
well-known Campbell-Mecke formula ([18, 19]) is equivalent to say that 

[ G.(dAfA(T)-AdT))] =AEq, [/ DrFGrdT 
L Jo J Ljo 

for G a deterministic process. The map T is defined by 

T : 91 —^ 




Proceeding exactly along the same lines as before, one can show that there exists 
cp > 0 such that 


( 8 ) 


sup |Eq,[F]-E^,[F]| < ^, 


where R) is the set of twice Frechet differentiable functionals on with 

bounded differentials. This is the generalization we could expect of (6). 

Other examples of the application of this procedure, involving other functional 
spaces, can be found in the papers [8, 12]. A similar approach with Malliavin 
calculus replaced by a coupling argument appears in [10]. 


4. Edgeworth expansion 

The Stein’s method as developed here can be iterated to obtain Edgeworth ex¬ 
pansions. We now want to precise the expansion obtained in (6). For, we go one 
step further in the Taylor formula (4): 

+ 1/VA) - i/>(iA) = + ^r{Zx) + + ^/v^). 

Hence, 

(9) E [Z).DPtF{Zx) - D^^'>PtF{Zx) 

= ^E iD^^^PtFiZx)] + ^E \D^^^PtF{Z + 6/xfX) . 

2vA L J dA L J 

If F is thrice differentiable with bounded derivatives then PtF is four times differen¬ 
tiable, hence the last term of (9) is bounded by A~^ lloo/6. Moreover, 

applying (6) to DPtF shows that 

E In'^^^PtFiZx)] = Ep iD^^^PtP] -tO(A-^/2). 

Combining the last two results, we obtain that for F thrice differentiable 

E [f(Za)J - Ep [F] = ^Ep J D^‘^'>PtF dt + 0(A-^). 
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This line of thought can be pursued at any order provided that F is assumed to 
have sufficient regularity and we get an Edgeworth expansion np to any power of 
Using the properties of Hermite polynomials, this leads to the expansion: 


E 


F{Zx) \ - Ep [F] = ^Ep [FUs] + 0(A-^) 


where 1-Ln is the n-th Hermite polynomials. In [9], we generalized this approach to 
the Poisson process-Brownian motion convergence established in (8). 


5. Conclusion 

We showed how the Stein’s method can be abstracted in the framework of Dirich- 
let forms and Malliavin calcnlus. This gives raise to a new method of proof which 
can be applied to infinite dimensional spaces and iterated to get Edgeworth expan¬ 
sions. One open qnestion is to apply this approach to other limiting processes like 
stable or max-stable processes, Brownian bridges, etc. 
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